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Abstract 

Changes in cis-regulatory element composition that result in novel patterns of gene expression are thought to be a major 
contributor to the evolution of lineage-specific traits. Although transcription factor binding events show substantial 
variation across species, most computational approaches to study regulatory elements focus primarily upon highly 
conserved sites, and rely heavily upon multiple sequence alignments. However, sequence conservation based approaches 
have limited ability to detect lineage-specific elements that could contribute to species-specific traits. In this paper, we 
describe a novel framework that utilizes a birth-death model to trace the evolution of lineage-specific binding sites without 
relying on detailed base-by-base cross-species alignments. Our model was applied to analyze the evolution of binding sites 
based on the ChlP-seq data for six transcription factors (GATA1, S0X2, CTCF, IVIYC, IVIAX, ETSl) along the lineage toward 
human after human-mouse common ancestor. We estimate that a substantial fraction of binding sites {~58-79% for each 
factor) in humans have origins since the divergence with mouse. Over 1 5% of all binding sites are unique to hominids. Such 
elements are often enriched near genes associated with specific pathways, and harbor more common SNPs than older 
binding sites in the human genome. These results support the ability of our method to identify lineage-specific regulatory 
elements and help understand their roles in shaping variation in gene regulation across species. 
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Introduction 

Changes in gene regulation play a key role in the evolution of 
morphological traits [1-3]. At the level of transcription, gene 
expression is controlled via transcription factor (TF) proteins that 
selectively bind to cis-regulatory elements in a sequence-specific 
manner [2,4]. Utilizing chromatin immunoprecipitation of specific 
TFs followed by high-throughput sequencing (ChlP-seq), recent 
studies showed that the evolution of these transcription factor 
binding sites (TFBS) is highly dynamic, with sites difFering a great 
deal even within mammals [5-9] . 

Despite substantial experimental evidence for rapid divergence 
of regulatory protein-binding events across species, computational 
models designed to analyze regulatory elements using cross-species 
comparisons have focused primarily upon 'phylogenetic footprint- 
ing' approaches, in which putatively functional regulatory 
elements are identified according to sequence conservation [10- 
1 5] . Previous computational studies have inferred the evolution of 
regulatory elements using, for example, the emergence of new 
conserved elements specific to a particular clade in the phylogeny 
[16] or lineage-specific alterations leading to a loss-of-function 
phenotype [17,18]. Although such approaches have been helpful 
in understanding lineage-specific regulatory element evolution, all 



inherentiy rely upon frxed cross-species alignments, which are 
frequently of low quality within non-coding regions in the genome 
[19-21]. Previous studies have estimated that more than 15% of 
aligned bases within human-mouse whole-genome alignments are 
incorrect [22] and the error rate increases when more species are 
involved [19]. Ancestral reconstruction, which is sensitive to details 
of the multiple alignment, is a particularly challenging problem for 
non-coding regions [23,24]. As a consequence, cross-species 
comparisons of non-coding sequences are limited in their ability 
to study regulatory sequence evolution, particularly in cases where 
the elements are selected for novelty or newly-derived. Such 
newly-derived regulatory elements are not rare; indeed, analyses 
using human population variation data from the 1000 Genomes 
Project [25] have shown that human genomic locations under 
selection undergo considerable turnover and frequentiy lie outside 
mammalian-conserved regions [26]. Yet, systematic identification 
of binding sites for specific TFs and assessment of their 
conservation and prevalence using cross-species comparisons 
remains a challenging problem. 

In this work, we introduce a novel evolutionary framework 
through which lineage-specific TFBSs can be inferred on a 
genome-wide scale. In contrast to conservation-based approaches 
[13,16,27], we utilize a birth-death model to infer ancestral states 
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Author Summary 

Recent experimental studies showed that the evolution of 
transcription factor binding sites (TFBS) is highly dynamic, 
with sites differing a great deal even between closely 
related mammalian species. Despite the substantial 
experimental evidence for rapid divergence of regulatory 
protein-binding events across species, computational 
methods designed to analyze regulatory elements evolu- 
tion have focused primarily on phylogenetic footprinting 
approaches, in which putative functional regulatory 
elements are identified according to strong sequence 
conservation. Cross-species comparisons of non-coding 
sequences are limited in their ability to fully understand 
the evolution of regulatory sequences, particularly in cases 
where the elements are selected for novelty or species- 
specific. We have developed a novel framework to 
reconstruct the history of lineage-specific TFBS and 
showed that large amount of TFBS in human were born 
after human-mouse divergence. These elements also have 
distinct biological implications as compared to more 
ancient ones. This method can help understand the roles 
of lineage-specific TFBS in shaping gene regulation across 
different species. 



of a given motif without the use of the base-by-base alignment 
details in the vmderlying cross-species sequence alignment. Gains 
and losses of TFBS have been exphcitly used both to improve 
cross-species sequence comparisons and to detect cis-regulatory 
modules, although such models are usually framed within the 
context of an alignment [21,28]. A more similar alignment-free 
model was previously used to measure the overall rate of TFBS 
creation along different lineages [29]. In this work, we instead 
applied our framework to infer lineage-specific TFBS, estimating 
the branch of origin of each individual TFBS for six TFs. We then 
studied patterns for TFBS with different branches of origin, 
including target genes of the newly-derived sites, relationship with 
within-human variation, overlap with transposable elements, and 
cell-type specificity of TF-binding. Our results provide strong 
support that this novel method can help identify lineage-specific 
regulatory elements, a first step towards understanding the role of 
regulatory element evolution in shaping the variation of gene 
regulation across species. 

Results 

Overview of the probabilistic frannework for the birth- 
death evolutionary model of TFBS 

Our goal is to detect lineage-specific rates of TFBS evolution 
and the branch of origin for individual TFBS. Here, lineage means 
any ancestral branch in the phylogeiiy or a branch leading toward 
any modern species. Our approach is to model TFBS evolution 
using a birth-death framework, in which individual TFBSs can be 
gained, lost, or conserved within a given lineage during evolution. 
The rate of TFBS creation (birth rate) and loss (death rate) are first 
estimated from a set of orthologous sequences, and are 
subsequently used to trace the evolutionary origin of individual 
TFBSs at the sequence level. The birth rate (a ) for a given motif 
represents the probability at which a TFBS appears at a single 
unoccupied site in a given year of evolutionary time. Similarly, the 
death rate {fi ) represents the rate at which an existing TFBS is lost 
per year. The method considers only TF motif counts within 
orthologous sequences across species, and therefore does not 
require an accurate base-to-base multiple sequence alignment. 



This framework allows us to reconstruct the ancestral states for 
each TFBS throughout the genome, providing a distribution for 
the branch of origin of the binding sites genome-wide. 

For any set of orthologous sequences across species and a known 
phylogeny, we first estimate birth and death rates according to the 
observed numbers of TF motif occurrences within each species. 
Such orthologous sequences can, for instance, be obtained using a 
genome-wide multiple species alignment. However, the underlying 
base-level alignment is ignored once the orthologous sequences are 
obtained, and subsequeiidy the model considers only the number 
of TF motifs within each sequence. Thus, the method operates 
independentiy of any details within the alignment once the 
sequence correspondence between species (i.e., orthologs) is 
obtained. Every node in the phylogeny is then associated with a 
(random) variable Qx, which represents the number of occurrences 
of the TFBSs at that node x. The value of Qx is known for each 
leaf node in the tree for any given ortholog set. Birth and death 
rates of a given motif can then be estimated by maximizing the 
likelihood across the entire data set, taking into account both 
branch lengths as well as the size of the sequence region (see 
Methods). Evolutionary rates were estimated using an iterative 
approach, but were found to be extremely robust according to the 
initial parameter settings. Once the birth and death rates are 
estimated using the fuU data set, we can use these rates to trace the 
branch of origin of individual TFBSs. This can be done by 
reconstructing the most likely ancestral state at each node of the 
phylogeny; i.e., the value of Qx that maximizes the likelihood of 
the data for each individual ChlP-seq peak region. This provides 
the most likely number of TF motif occurrences at each node, and 
allows us to trace the most likely branch of origin for individual 
site. The overall procedure of our method works as follows. (1) We 
identify motif occurrences within ChlP-seq peak regions in human 
for a given TF. (2) We estimate the likelihood for each ancestral 
node in the phylogeny given the motif occurrences in the 
descendant species. (3) We determine the branch of origin for 
the TF-bound motif within ChlP-seq peak regions. See Methods 
and Supplementary Methods in Text SI for details. 

The model framework and its motivations are illustrated in 
Figure 1. Figure lA shows one scenario where the binding site was 
introduced to the genome through transposable elements (TEs) 
insertion followed by point mutation, which is most likely branch 
of origin of this site under our model. Figure 1 B shows an example 
that our method is able to identify cases of TFBS turnover within 
stationary modules that might not otherwise be detected using 
human-mouse ChlP-seq data direct comparisons. In this genomic 
region, there is a human GATAl binding site originating on the 
ancestral primate lineage and a GATAl binding site specific to 
mouse and rat. Although the ChlP-seq peaks appear in the same 
location between human and mouse, our model can predict such 
lineage-specific events (which is also reflected in the cross-species 
alignment). Again, our algorithm predicted these branches of 
origin accurately without detailed alignment. 

We note that in addition to the robustness of the parameter 
estimates of a and , the branch of origin estimates were quite 
robust, since they were far more dependent on the number of sites 
in each species in each region (usually either 0 or 1) than the exact 
values of a and P . Although the loss of positional information 
would appear to make the approach insensitive to some cases of 
TFBS turnover, in which the creation of a new binding site 
coincides with the loss of an old binding site, in practice such cases 
are only problematic when applying the method to long branches 
in the phylogeny. For densely sampled phylogenies containing 
relatively short branch lengths, such turnover events can be 
inferred as long as the gain and loss occur on different branches. 
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Figure 1. Model framework for lineage-specific GATA1 binding sites. Multiple alignments are shown for two GATA1 -bound regions in 
humans. Red and blue boxes in the alignment correspond to GATA1 binding sites. Phylogenies illustrate the birth-death model framework, where the 
most likely number of binding sites is assigned to each ancestral node (denoted here as either 'present' or 'absent', at values 1 or 0 in this example). 
Highlighted branches denote the branch of origin. Evolutionary comparisons were conducted across ten primate species, as well as 36 non-primate 
vertebrates (not all are shown). (A) A binding site originating within an LTR insertion. (B) A genomic region containing a human GATA1 binding site 
originating along the ancestral primate lineage and a GATA1 binding site specific to mouse and rat. Despite nearly identical locations of the ChlP-seq 
peaks across human and mouse (in analogous Erythroblast cell lines), the ability of the method to identify specific branches of origin allows us to 
identify cases of TFBS turnover in close proximity. 
doi:10.1371/journal.pcbi.1003771.g001 
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This is usually the case, since old binding sites are seldom lost 
through selection and are generally lost slowly, following a nearly- 
neutral rate of decay [29,30]. Many neutrally (or near-neutrally) 
evolving sites are still present after relatively long periods of 
mammalian evolution [29]. 

Large number of TFBS embedded in ChlP-seq peaks of a 
particular TF exhibit increased evolutionary rates along 
lineages leading to human 

In this work, we applied our method to ChlP-seq data, which is 
now commonly used to map in vivo TF occupancy genome-wide 
[31]. We applied our method to ChlP-seq data sets for six TFs, 
namely GATAl, SOX2, MYC, MAX, ETSl, and CTCF [32- 
35]. These TFs were chosen, in part, for their diverse functional 
attributes, their well-documented binding motifs, and the avail- 
ability of ChlP-seq data in analogous cell types in human and 
mouse. Using our method, we can determine cases in which there 
are lineage-specific differences in evolutionary rates of a given 
motif along a particular branch in the j)li5'log(;ny. Since j)rcvious 
comparisons of ChlP-seq data from human and mouse have 
reported substantial divergence in protein-binding locations across 
the two species [5,6], ChlP-seq peaks in human are likely to 
contain a high enrichment of TFBSs compared to the orthologous 
regions in more distantly-related species. We thus hypotlu'sized 
that functional motifs present among ChlP-seq peak regions might 
be detectable by testing for an increased birth rate a along 
lineages ancestral to humans relative to other lineages, since any 
recently-acquired TFBSs in humans would naturally increase the 
birth rate along these lineages. 

To determine differences in tlie rate of motif evolution along 
specific lineages, we first assume a simple (nuU) model in which the 
birth and death rates (a and ^ ) remain constant across the entire 
phylogeny. We can then compare this hypothesis to a model in 
which birth and death rates differ along a single branch of the 
phylogeny relative to the other branches. The statistical signifi- 
cance of lineage-specific evolutionary rates can then be assessed 
using a likelihood-ratio test [36], producing a P-value reflecting the 
significance of lineage-specific differences in evolutionary rates 
along that branch (Supplementary Methods in Text SI). 

We apphed this approach to human ChlP-seq data generated 
for the six TFs, testing for increased birth rates within the (— 100,+ 
100) region relative to the summit of the peaks. Orthologous 
regions were then determined using 46-way multiz alignments 
from the UCSC Genome Browser [37], and analyses were 
conducted using data from all 46 vertebrate lineages according to 
their known phylogcii}-. For e\-ery TF, with the exception of MYC, 
the known binding motif of TF was predicted with a substantially 
increased birth rate along branches ancestral to humans (P<le- 
15). We note that in contrast to motif prediction using 
conservation-based approaches, our method generates motif 
predictions specifically using liiu'age-specific binding sites (or 
rather, their increased rate of creation along a specific lineage). For 
five of the six factors (GATAl, SOX2, MAX, CTCF, and ETSl), 
the documented binding motif of the TF produced the most 
statistically significant motif prediction using our method. The 
MYC binding motif, which has previously been noted for its strong 
patterns of conservation [27], was the only factor whose binding 
motif was not the top-ranked prediction, although it was stiU 
predicted under the P<le-15 threshold. For each factor, we used 
an iterative method to generate a Position Weight Matrix (PWM) 
according to the nucleotide composition at each site of the motif 
within the (— 100,+100) window of peaks in humans. These 
predicted PWMs very well matched with the known binding motifs 



as well as the results from the MEME suite [38] (Supplementary 
Methods in Text SI and Table SI). 

Substantial number of human TFBSs have recent origins 
after the human-mouse divergence 

Using our approach, we sought to determine the branch of 
origin for each human binding site for the six TFs. Each binding 
site was thus either inferred to be present in the common human- 
mouse ancestor, or a more recent lineage leading to human using 
the phylogeny shown in Figure 1 . The distribution of the branch 
of origin for each TFBS is shown in Figure 2. Notably, between 
~58-79% of all human TFBSs had inferred origins after the 
human-mouse split. 

In addition to estimating the fraction of ancestral binding sit(;s 
present in the human-mouse common ancestor, our approach 
allows us to estimate the age distribution across all binding site 
occurrences. As shown in Figure 2 (left panel), a sizeable fraction 
of binding sites in humans were estimated to have recent origins in 
primates. For instance, the fraction of human binding sites that are 
unique to hominids ranged from 16.1% for ETSl to 31.1% for 
MYC. Additionally, the number of sites estimated to be human- 
specific, i.e., with recent origins after the human-chimp diver- 
gence, ranged from 3.5% to 10%. 

Since the total number of protein-bound sites for each factor is 
quite large, these fractions represent a considerable number of sites 
unique to humans or closely-related primate lineages. The number 
of human binding sites originating since the common hominid 
ancestor ranged from 1,084 to 3,931 for each factor, including 440 
to 1,409 human-specific binding sites. The rate of appearance for 
new sites akmg different ancestral lineages showed a relatively 
stable rate of creation for new binding sites for most TFs up 
through the common Simian-primate ancestor (Figure 2, right 
panel), with slight increase in more recent branches. The birth rate 
of new sites ranges from 50-300 per million years for most TFs. 

Recent works have emphasized the emergence of new 
regulator)' elemc'iits via transposable elements (TEs) [7,39—41]. 
We assessed the amount of overlap between newly-derived TFBSs 
and annotated TEs determined by RepeatMasker [42] . Consistent 
with previous reports [7,43], a substantial fraction of binding sites 
with recent origin were located within TEs. The number of TE- 
derived sites varied between factors, ranging from approximately 
35-40% for recently-derived SOX2 and GATAl binding sites to 
approximately 15-20% for newly-derived ETSl and MYC 
binding sites. TE-derived TFBSs of specific branch of origins 
were associated with different TE families with different times of 
origin (Supplementary Results in Text SI and Figure SI). 

TFBS origin estimates correspond to human-mouse ChlP- 
seq data and cross-species alignments 

To assess the accuracy of the age estimates, we first compared 
our results to ChlP-seq data from human and mouse. Using 
analogous cell types across species, we determined the amount of 
overlap between human ChlP-seq peaks and ChlP-seq peaks in 
the orthologous regions in mouse. A human ChlP-seq peak was 
considered to be 'shared' with mouse if its summit was within 
200 bp of a mouse ChlP-seq peak summit in the orthologous 
region (note that the mouse ChlP-seq data were not the input our 
algorithm). The amount of overlap was assessed separately for 
regions containing a human binding site present in the common 
human-mouse ancestor and for regions that are not ancestral. 

We emphasize that, as illustrated previously in Figure IB, 
ChlP-seq peaks shared across human and mouse can often contain 
TFBSs that are genuinely lineage-specific, since ChlP-seq peaks 
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Figure 2. Time of origins for binding sites of six TFs in humans. Binding motifs were determined using human CInlP-seq data for GATA1, 
S0X2, MYC, CTCF, ETS1, and MAX. The branch of origin was determined for each binding site within the (-100,+! 00) region relative to a human ChlP- 
seq peak summit. (Left) Distribution of the branch of origin for each binding site. Branch labels correspond to those in Figure IB. 'Ancestral' binding 
sites have origins prior to human-mouse divergence. (Right) The rate of binding site creation along branches ancestral to humans. Rates were 
estimated by dividing the number of sites originating along each branch by evolutionary time, including only binding sites currently existing in 
humans. 

doi:1 0.1 371/journal.pcbi.1 003771 .g002 



span a relatively broad region and can contain instances of TFBS 
turnover within static modules. In addition, human-specific ChlP- 
seq peaks can also contain ancestral binding sites, since such sites 
can either be lost (non-conserved) along the mouse lineage or may 
not be bound by the TF along that lineage. 

Table 1 shows the amount of overlap in ChlP-seq peaks between 
human and mouse according to the estimated branch of origin of 
the TFBSs. Human peaks containing predicted ancestral TFBSs 
were far more likely to overlap with bound regions in mouse than 
peaks containing only predicted lineage-specific sites. Between 24— 
4 1 % of human peaks that overlapped with a peak for the same TF 
in mouse contained only predicted lineage-specific TFBSs, while 
59-76% of shared peaks contained a predicted ancestral TFBS. 
Thus, there was a clear enrichment for TFBSs predicted to be 
ancestral among the ChlP-seq peaks shared between human and 
mouse. Among human-specific ChlP-seq peaks, a substantially 
greater number contained only lineage-specific TFBSs than sites 
predicted to be ancestral to human and mouse. 

Although a relatively sizeable portion of shared ChlP-seq peaks 
contained only TFBSs predicted to be hiieage-specific, in the 



majority of cases (>90%) the mouse TFBS did not occur within in 
sequence region orthologous to the human peak region used, but 
was instead offset to a non-overlapping region within a mouse 
peak. Very few of these TFBSs actually aligned across the two 
species, compared with those with predicted ancestral origin. 

We compared the sequence level conservation of predicted 
TFBSs according to their branch of origins. We used the PhyloP 
mammalian conservation scores [44] available at the UCSC 
Genome Browser to determine the conservation level for TFBS in 
human. For a specific TF, we first computed the average PhyloP 
score (X) in each ChlP-seq peak and then calculated the average 
score (M) as well as standard deviation (SD) across all peaks in the 
genome. We then grouped the binding sites according to their 
branch of origin (in four groups: Hominid-specific, Simian- 
specific, Primate-specific, and Eutherian-specific) and calculated 
the average PhyloP score (X). Finally, we calculated the Z-score, 
i.e. (X-M)/SD). As expected, older binding regions show higher 
sequence level conservation than younger ones (Figure S2). These 
results suggest that our method can identify more recent, less- 
conserved TFBS, without relying on sequence-level conservation. 
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Additionally, to further demonstrate the effectiveness of our 
method in identifying conserved TFBS, we direcdy compared with 
methods that use phylogenetic footprinting approaches. We 
compared with phylogenetic footprinting methods at both element 
level (using MotifMap [45] which is based on the method used in 
[46-48]) and module level (using PReMod [12]). Overall, our 
method outperformed both MotifMap and PReMod (see Supple- 
mentary Results in Text SI, Table S3, and Figure S5). 

Within-species variation is higher among TFBSs of more 
recent origin 

Recent work has reported a substantial difference between 
genomic locations that are conserved across species versus those 
conser\'ed within the human population [26]. Thus, we compared 
both human variation data as well as sequence conservation across 
primates to the relative age of the TFBSs. Comparing the overall 
frequency of common SNPs in humans among TFBSs originating 
at different times of evolution showed that a substantial fraction of 
human-specific TFBSs contained common SNPs, comprising over 
6% of all human-specific TFBSs (Figure 3). This is much higher 
than the total fraction of TFBSs overlapping with a common SNP, 
at a median of less than 3% across all six factors. 

Since substantial variation exists in TF-binding events between 
human individuals [9], this high amount of variation among 
human-specific binding sites may partially reflect the fact that 
some TFBSs inferred to be human-specific may not be shared by 
the entire human population. However, recentiy-derived TFBSs in 
hominids were also substantially enriched for common SNPs, even 
when excluding human-specific TFBSs. For instance, among 
hominid-specific binding sites that are not human-specific, with a 
median of almost 4% of all sites. As these sites are shared across 
species, they cannot be fully explained by variation within the 
population. In contrast, common SNPs were consistentiy low 
among TFBSs with origins prior to hominids (Figure 3). Note that 
this observation was not biased by the SNP density surrounding 
the binding sites (Figure S3). 

Hominid-specific binding sites target specific biological 

processes 

To determine potential functions for the newly derived binding 
sites, we tested whether genes predicted to be targeted by binding 
sites with recent origins in hominids were involved in specific 
biological processes or pathways. Such enrichment was deter- 
mined for genes near hominid-specific binding sit(s compared to 
the total list of protein-bound sites for each factor, where each 
TFBS was mapped to the nearest TSS, up to a distance of 100 kb. 
This allowed us to assess potential lineage-specific functions of 
these sites relative to sites of more ancient origin. 

Genes located nearest to hominid-specific binding sites were 
more frequently enriched for neural and sensory-related functions, 
and were in many cases involved in neurological pathways (Table 
S2). CTCF, MYC, and SOX2 target gene sets were all enriched 
for GO categories involved in sensory perception, while GATAl, 
MYC, ETSl, and MAX were enriched for neural development 
and difiFerentiation categories. Among the six factors, neural- 
related functions are only well-documented for SOX2, which is 
involved in neuronal-cell maintenance [49,50] and whose 
hominid-specific target sites are enriched genes involved in sensory 
perception. Similarly, genes in proximity to hominid-specific 
binding sites for CTCF and MYC were enriched for sensory 
perception processes and pathways, particularly those related to 
olfaction, and in the case for MYC, hominid-specific target genes 
were also enriched for genes involved in synapse assembly and 



receptor clustering and binding. Hominid-specific binding sites for 
GATAl, most commonly known for its role in erythroid 
differentiation [51], were also found enriched near genes involved 
in axon extension of neural cells. For ETSl, hominid-specific 
binding sites were near genes involved in spinal cord neuron 
difiFerentiation, ventral spinal cord development, and behavioral 
fear response. We also found that the hominid-specific sites are 
near genes in different pathways as compared to Simian-specific 
sites and more ancestral sites (Table S4 and Table S5). 

Branches of origin of CTCF binding sites differ between 
cell type specific and ubiquitously bound sites 

ChlP-seq data of CTCF is available for several diflferent cell 
types, and thus we sought to determine whether CTCF-bound 
sites have distinct age distributions according to cell type. We thus 
expanded our analyses to include ChlP-seq data for CTCF 
collected from four different cell types in humans: B-lymphocytes 
(GM12878), embryonic stem ceUs (HlhESC), cerebeUum (HAC), 
and kidney cells (HRE). Interestingly, we observed substantial 
differences in the age distribution of cell-type specific and 
ubiquitously-bound sites. Figure 4 shows the age distribution for 
CTCF-bound sites, separated according to the number of cell 
types in which each site was bound. Only 43% of all sites bound by 
CTCF in all four cell types were primate-specific and 10% were 
specific to hominids. For cell-specific sites, i.e., those bound by 
CTCF in only one cell-type, these fractions increased to 63% and 
24%, respectively (Figure 4). 

These results are consistent with previous observations that cell- 
type specific CTCF binding sites arc less conserved across 
mammalian species than ul)ifjuitously-b()und site's [39,52]. How- 
ever, the recent origin of the cell-type specific TFBSs suggests that 
this cannot simply be explained by a relative lack of selective 
pressure, since cell-specific CTCF binding sites were frequentiy 
found to be absent in older lineages. Instead, there exists the 
possibility that cell type specific TFBSs might contribute to 
Kneage-specific regulatory function. 

Our method identified a TFBS turnover event within a 

functionally conserved enhancer 

Using our framework, we then utilized genome-wide chromatin 
data to search for potential functional consequences driven by 
birth or death of specific lineage-specific TFBS. We intersected the 
lineage-specific TFBSs with predicted human enhancer regions 
marked by ChromHMM model [53] as well as in vivo verified 
enhancers listed in the VISTA Enhancer Browser [54]. Figure 5 
shows a potential functional take-over through TFBS turnover 
inside an enhancer after human-mouse divergence. At the 
sequence level, two MAX binding sites were identified by our 
method with an ancestral one and a primate-specific binding site 
emerging after human-bushbaby split (Figure 5). Here these two 
MAX binding sites are also MYC binding sites since MAX and 
MYC have very similar motif (their ChlP-seq peaks overlap in 
Figure 5). The orthologous region of predicted primate-specific 
MAX/MYC binding site has no MAX or MYC ChlP-seq signal 
at all in mouse, which is consistent with our lineage-specific 
prediction. Since the young MAX/MYC binding site only locates 
1,700 bp upstream of the ancestral one and ChlP-seq intensity of 
ancestral binding site is much weaker in human compared to 
mouse, this is likely to be a turnover of MAX/MYC binding site 
within the enhancer. Then we asked whether the function of 
predicted enhancer was conserved between human and mouse. 
Interestingly, despite the potential turnover of MAX/MYC 
binding site in the sequence level, the mouse orthologous region 
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Figure 3. Within-species variation of binding sites according to time of origin. Boxplots show the fraction of TFBSs containing common 
SNPs in humans [72], where plots show the median (center line), upper- and lower-quartile (boxes), and range (whisker extremes) of percentages across 
the TFBSs of six TPs. TFBSs are categorized as human-specific, hominid-specific (not including human-specific sites), Simian primate-specific (not 
including hominid-specific sites), and ancestral (present in the human-mouse common ancestor). Overall fractions (including all sites) are shown in the 
left-most boxplot. Note the substantial rise in the amount of human variation within more recently derived binding sites compared to older sites. 
doi:1 0.1 371 /journal.pcbi.1 003771. g003 
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Figure 4. Time of origin for Kiuman CTCF binding sites according to cell-specificity. CTCF binding sites in humans were separated 
according to cell-specificity, considering four distinct cell lines (GM12878, HlhESC, HAC, and HRE). Colored bars correspond to varying amounts of 
cell-specificity, denoting sites bound in one, two, three, or in all four cell types (red to dark blue bars, respectively). Note the tendency for cell-specific 
binding sites to have more recent evolutionary origins than sites bound ubiquitously in all cell types. 
doi:l 0.1 371/journal.pcbi.l 003771 .g004 



of predicted enhancer was found to drive reproducible LacZ 
expression in El 1.5 mouse blood cell as demonstrated by in vivo 
transgenic mouse embryos essay based on VISTA Enhancer 
Browser, which confirms that the predicted enhancer also 
functions as an enhancer in mouse. It certainly remains to be 
solved whether this enhancer regulates the same genes in human 
and mouse and why the ChlP-seq signal on ancient MAX/MYC 
binding site is much weaker than the younger one in human. 
Nevertheless, this example demonstrates the ability of our method 
to compare functional level dynamics with sequence level 
difference in an evolutionary framework. 

Discussion 

The approach presented here represents an initial step towards 
understanding regulatory element evolution using in silico 
methods without relying on accurate cross-species alignments. 
Studies regarding the evolution of TFBSs are largely separable 
into those that emphasize cross-species conservation of regulatory 
elements [13,27,55-57] and studies highlighting the substantial 
divergence of transcription factor-binding events across species [5- 
7,58-60]. To some extent, this dichotomy may largely reflect 
differences between analyses conducted in silico and experiment- 
based studies. Although computational approaches have had some 
success in identifying cis-regulatory alterations responsible for 
changes in phenotype [16,17], the study of regulatory sequence 
evolution is limited by reliance upon multiple sequence alignments 
[20,21]. Reconstructing the ancestral states of regulatory sequenc- 
es is a particularly challenging probk'm; comparative studies of 
regulator)' ck-ments generally categorize sites as 'conserved' and 
'non-conserved' in terms of their presence across species 
[34,39,61]. 'Non-conserved' sites are thus often assumed to be 
under a weaker amount of purifying selection, indicating a relative 
lack of function [62]. However, any interpretation of the results is 
obscured by the fact that no distinction can be made between 
ancestral sites that have later been lost, v(;rsus sites of recent 
origins. Newly-derived functional elements, which are also 'non- 
conserved' by the common definition, may induce a gain-of- 
function trait, harboring the potential for lineage-specific adapta- 
tion or positive selection. It has long been argued that alterations 
in regulatory function are responsible for many, if not most, 
species-specific traits [1-3], and it is indeed these elements that are 
likely to contribute to phenotypic adaptation and the variation 
seen across species. 

In this context, the high fraction of TFBSs that have recent 
origins after human-mouse divergence is particularly notable. For 
all six factors analyzed, the majority of human TFBSs bound in 
vivo were originally absent in human-mouse common ancestor, 
which is consistent with previous cross-species comparisons noting 
substantial divergence in ChlP-seq protein-binding events across 
the two sp(;cies [5,6] and similar comparisons presented here 
(Table 1), and is also comparable to detailed analyses conducted in 
Drosophila using alternative approaches [57]. This does not 
appear to simply be a consequence of the specific selection of TFs, 
since binding motifs for several factors analyzed (CTCF, ETSl, 
MYC, MAX) have been previously documented as 'highly 
conserved' compared to motifs of other factors [27,39]. In 
particular, a comprehensive scan for conserved motifs identified 



the binding motifs of MYC and ETSl as the second and third- 
highest ranking motifs across all motifs in terms of conservation 
score across human, mouse, rat, and dog (where the ETS 1 binding 
motif was denoted as the ELKl motif) [27]. It is important to note 
that 'phylogenetic footprinting' approaches usually measure motif 
conservation relative to neutrally evolving elements in non-coding 
regions. The fact that such motifs are substantially more conserved 
compared to a neutral proxy does not mean that the majority of 
binding sites are conserved, nor does it imply that protein-bound 
sites considered collectively across the genome are more conserved 
than coding sequences under relatively weak selection. Since some 
of the most 'highly conserved' regulatory motifs are largely 
comprised by sites with recent origins, it is unlikely that this birth 
of new binding sites is simply a special property of a handful of 
TFs, but instead is likely to apply across many other factors. 
Nevertheless, some analysis challenges remain. For example, it is 
acknowledged that not all computationally predicted TFBSs 
within ChlP-seq peaks were actively bound by the TF and many 
ChlP-seq peaks may correspond to experimental noise. While 
researchers have continuously improved TFBS identification, 
high-throughput experimental approaches would be necessary to 
systematically validate binding site predictions, especially for 
lineage-specific ones. 

Among TFBSs in humans, a considerable amount of them are 
unique to hominids and are even human-specific. Since there are 
an estimated ~1700-1900 TFs in the human genome [63], if even 
a small fraction of these sites harbor important regulatory 
potential, the total number of human-specific functional binding 
sites genome-wide is quite large. Although not all TFBSs with 
recent origins will be responsible for lineage-specific traits, these 
results nonetheless offer the potential to understand adaptive 
evolution of gene regulation via the creation of new TFBSs, not 
only alone, but also in combination. In addition, a number of 
recent studies have highlighted differences in transcription factor 
binding within humans [9,64], and our results suggest that 
sequence-level variation of TFBSs within the population may be 
more common among more recently-derived binding sites. 

The functional categories of genes close to recendy derived 
binding sites may serve to shed new light upon adaptive traits 
obtained along the lineage leading to human. It is interesting that, 
despite substantial differences in the biological functions generally 
associated with the six TFs analyzed, genes near binding sites with 
recent origins are enriched for several sensory and neural related 
pathways and processes. We note that since the ChlP-seq data we 
used in this study were not derived from neuronal cells, further 
study is needed to more comprehensively understand the roles of 
hominid-specific sites in specific cell types. Nevertheless, identify- 
ing such lineage-specific regulatory elements not only provides 
potential insight on human biology, but may also provide new 
knowledge on the molecular mechanisms of human diseases. 

A natural future direction for this work would be to determine 
the specific regulatory effects of the recently derived TFBSs 
identified using this method. For instance, enrichment for within- 
species variation among recently derived binding sites raises the 
intriguing possibility that recently derived TFBSs most responsible 
for phenotypic differences across species are also the elements 
responsible for within-species variation. Future work will be 
necessary to demonstrate whether this is the case and the 
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Figure 5. A TFBS turnover event within a functionally conserved enhancer. A TFBS turnover event shows the impact of lineage-specific 
TFBS within an enhancer. The Genome Browser view shows the upstream of human gene EPB41 . VISTA Enhancer track and ChromHWIM track (orange 
means strong enhancer, yellow means weak enhancer) indicate a putative human enhancer. ChlP-seq signals of three TFs used in this study near 
predicted enhancer region are consistent with predicted lineage-specific binding site represented by 46-way multiple sequence alignment (only a 
subset of species are shown). Note that here the two MAX binding sites are also MYC binding sites since MAX and MYC have very similar motif. A 
potential TFBS turnover is observed between two predicted IVIAX/MYC binding sites (1700 bp apart). Different TFBSs are highlighted in different 
colors with MAX in blue and GATA1 in red. The predicted enhancer may function as blood cell specific enhancer in mouse, demonstrated by images 
of LacZ positive El 1.5 mouse transgenic embryo on the VISTA Enhancer Browser [54] (ID: mmSO; http://enhancer.lbl.gov/cgi-bin/imagedb3. 
pl?form = presentation&show = 1 &experiment_id = 80&organism_id = 2). 
doi:1 0.1 371/journal.pcbi.1 003771 .g005 
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differences in traits brought about by this variation. Also, our 
current model needs to be integrated with gene expression data to 
understand the interplay between cis-regulatory element evolution 
(e.g., binding site turnover and lineage-specific sites) and gene 
expression differences across different species [65-68] . The extent 
to which newly derived TFBSs operate primarily as cell-type 
specific elements with cell-specific regulatory fiinction also remains 
an open question. The mechanisms that contribute to cell-type 
specific TF binding, whether through the presence or absence of 
other protein factors, accessibility of DNA within the chromatin 
structure, or by other means, are also possible future directions 
that can be more fuUy understood using a combination of different 
types of data that are becoming available. 

Methods 

The model 

Our approach is designed to estimate genome-wide rates of 
evolution for a given motff according to a birth-death framework 
(formally, a quasi birth-death process [69]), similar to that used to 
measure the timing of accelerated motif evolution as in [29]. The 
birth rate (a) represents the rate at which a new motif occurrence 
appears at any unoccupied site per year, while the death rate (Ji) 
represents the rate at which an existing site is lost per year. Given a 
set of orthologous sequences and a known phylogeny, we estimate 
birth and death rates for the motif across the phylogenetic tree 
using a maximum likelihood approach. 

Let w(t) denote the probability that a given TFBS motif 
occupies that nucleotide position at time Z The probability that the 
motif will exist at time ^-|- 1 is then 

w(t+\) = a{\- H>(0) + ( 1 - lS)w{t) ( 1 ) 

Setting vfXO = w{t+l) — w{t) to be the rate of change of w 
with respect to t, Eq (1) gives the differential equation 

w'(t) = ix-((x. + p)w(t) (2) 

We denote the solutions to Eq (2) by u{t) and v{t) , where u(f) 
assumes that the motif was present at this site at time t= 0, while 
v(t) assumes that the motif did not exist at time / = 0 (i.e., initial 
conditions vv(0) = 1 and H'(0) = 0, respectively). As u(t) and 
v(/) are solutions for w{t) in Eq (2), both represent the 
probability that the motif exists at a specific nucleotide position 
after time t, differing only in the initial conditions. Solving Eq (2) 
gives 

uit)=—Ja + Pe-^'-'-+"^'], v(f)=^[l-e-<"+»'l (3) 

The transition probability Pij(t) is the conditional probability 
that a given region will contain j occurrences of the motif after 
time t, assuming i initial occurrences of the motff within the 
region. Namely, if a sequence initially contains i motif occurrenc- 
es, the probability [/,■_ k{() tiiat k of these occurrences remain after 
time / is given by the binomial distribution: 

= ^^3^ u(tf{\ - um-^ (4) 

Similarly, ff the width of our region is iV nucleotide sites, there 
are initially N — i unoccupied sites. Thus the probability 



Vn - i, a(0 that b of these unoccupied sites become occupied 
after time t also follows the binomial distribution: 

K.-a(0= (^^5.J-,,, v(0^(l-v(0f--^ (5) 

The transition probability Pij{t) that the given region contains j 
sites after time t is then given by 

min(iV) 

Pij{t)= Uij,{i)-Vi,.ij^u{t) (6) 

A; = 0 

Here, the sum is over all possible values k, where k represents the 
number of motif occurrences at time t among the sites that were 
originally occupied at time ?= 0. 

Calculating the likelihood of the data 

Given the birth and death rates (a and ji) across the tree (which 
are estimated using the method described below), we can calculate 
the likelihood of the data using Felsenstern's pruning algorithm 
[70]. Let us first consider data from a single sequence. We let 
0 = [a, P\ represent the parameter vector comprising the birth 
and death rates, and let Dy represent the data downstream of a 
node Y in the phylogeny. Let Z\, Z2, Z,„ be the daughter 
nodes of Y, occurring at times tzi, tz2, •••> relative to parent 
node F, respectively. 

If random variable Qy represents the number of motif 
occurrences at node Y, the likelihood xy(v,(i) = Pr(/)y| Qy = 
i; 9) of the data downstream of Y assuming / motif occurrences 
exist at node Y can be obtained recursively. This likelihood is 
given by 

xyii; 9) = ^;7i,(rz^)-xz^(/; 0) (7) 

J 

where the inner sum is across all possible values for j, 
corresponding to the number of motif occurrences at daughter 
node Zk. If node Z is a modern lineage, the probability Xz(j', 9) is 
equal to 1 if we actually observe j motif occurrences within the 
sequence, while the likelihood is zero otherwise. 

The likelihood of the data can therefore be obtained recursively 
by determining the values x(i; 9) progressively for each node 
farther up the tree. The log-likelihood L{D^ ',9 ) for a single 
sequence (the kth sequence) is then given by 

9) = log[Y,jP{jyxRU; 0}] (8) 

where R is the root node and P(j) is the prior probabiUty that j 
binding sites exist in a single sequence. For our implementation, 
prior probabilities P(j) were set to the Poisson distribution: 
F(j) = A e ^ ^ //! where / is the mean number of motif 
occurrences per sequence. The total log-likelihood L{D; 9) is then 
the sum L(Z);0) = Et= 1 ^{0^''^ -,9) across each of die n 
regions. 

Determining the optimal ancestral states 

We can determine the most likely ancestral states using the 
computed values for x{ j\ 9) aX each node in the phylogeny. At 
the root node R, the most likely ancestral state is the one that 
produces the highest likelihood; that is, the value of j that 
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maximizes the expression Qr = argmax,/'(/') • XR(j; 6) . Progres- 
sively moving down the tree, if the most likely number of motif 
occurrences at parent node Y is qy, the optimal number of motif 
occurrences qz at a daughter node Z is given by 

qz = argmax,xz(/'; dypgyjitz) (9) 
where tz is the branch length from node Y to node Z. 

Birth-death rate estimation 

Birth and death rates can be estimated using a maximum- 
likelihood approach. Namely, we use an EM-based approach [71] to 
iteratively optimize the lik(;lihood of the data D given the parami;ters 
6 = [a, /?]. We begin with an initial estimate 6 for the birth- 
death rates, generated by determining empirical birth-death rates 
after conducting ancestral reconstruction using parsimony (in our 
analysis, we found that the optimized parameters were not sensitive 
to the initial estimates). We determine the most likely ancestral state 
at each node given the initial parameter \'alues. We then determine 
the observed number of births and deaths according to these optimal 
ancestral states, providing new estimates for the birth and death rates 
9 = [a ,/^"*]- ^Ve then continue the process, using the 
prewous paramet(;r c-stimates 9 at each iteration to estimate the 
optimal ancestral states and obtain more optimal estimates of the 
birth and death rates until convergence (i.e., where 

1 1 0 0+ 1) - 0 W 1 1 falls below a certain threshold). 

Supporting Information 

Figure SI Fraction of binding sites overlapping trans- 
posable elements. Plots show the percentage of human TFBSs 
overlapping transposable elements (TEs) (y-axis), where binding 
sites are separated according to the branch of origin (x-axis). (A) 
Each colored plot corresponds to a single consensus motif; the 
fraction of binding sites overlapping TEs documented by 
RepeatMasker [42] are gi\cn as percentages along the y-axis. (B) 
Genome-wide prevalence of seven common TE families in 
humans. Fractions denote the total number of sites derived from 
each family across all TEs documented by RepeatMasker. (C) TE 
family composition for TE-derived TFBS according to the age of 
origin of SOX2, GATAl, and CTCF binding sites. The total 
height of each plot shows the total fraction of TFBS overlapping 
known TEs. Colored regions corresponding to the colored regions 
of Panel B denote the fraction of TFBS derived from each TE 
family according to the estimated age of the binding sites (x-axis). 
(TIF) 

Figure S2 PhyloP conservation vs. TFBS with different 
branch of origins. We used the PhyloP mammalian conserva- 
tion scores available at the UCSC Genome Browser to determine 
the sequence conservation level for TFBS with different branch of 
origins in human. X-axis shows TFBS with diflFerent branch of 
origins for four cUfFerent window sizes surrouncUng the peak 
summit. Y-axis shows the Z-score distribution for each group. For 
a specific TF, we first computed the average PhyloP score {X) in 
each ChIP-s(;q p(;ak and then calculated the average score (M) as 
well as standard de\'iation (SD) across all peaks in the genome. We 
then grouped the binding sites according to their branch of origin 
(in four groups: Hominid-specific, Simian-specific, Primate- 
specific, and Eutherian-specific) and calculated the average PhyloP 
score (X). Finally, we calculated the Z-score, i.e. {X-M)/SD. t- 
statistic from t-test between the youngest and the oldest for each 
group is also shown. 
(TIF) 



Figure S3 Background SNP density for TFBSs with 
different branch of origin. TFBSs were grouped into different 
branches of origin (X-axis). To calculate the background SNP 

density surrounding these TFBSs, we extended 1 kb, .500 bp, or 
125 bp to both directions (i.e., 2k, Ik, or 250 bp window) and 
counted the number of common SNPs in this 2 kb window. The 
figure shows that there are no significant differences of SNP 
density surrounding the TFBSs with different branches of origin. 
(TIF) 

Figure S4 Positive enhancer rate based on the VISTA 
enhancer database. The positive rate is the percentage of 
human enhancers that also show enhancer activity in mouse. The 
expected rate is based on all the enhancers overlapping with TFBS 
used in our six TF data sets. 'More ancestral enhancer' has higher 
positive enhancer rate compared with 'more lineage-spetdfic 
enhancers' or 'neutral enhancer', which is generally consistent 
with our computational prediction that the ancestral TFBS are 
more functionally conserved than lineage-specific TFBS, even 
though this comparison dataset is not ideal. See Supplementary 
Results in Text SI for details. 
(TIF) 

Figure S5 Comparison between our method and Motif- 
Map. A receiver operating characteristic (ROC) curve shows the 
prediction power between our method and MotifMap. ROC 
curves for MotifMap were generated using diflFerent BBLS 
thresholds (ranging from zero to the maximum possible BBLS 
score here, 4.73) to call a TFBS as a conserved one. In our 
method, we tested two shift sizes, +/ — 15 bp (light blue) and +/ — 
30 bp (dark blue). The results from MotifMap were based on +/ — 
15 bp shift size (magenta). See Supplementary Results in Text SI 
for detailed explanation of the comparison method and how the 
benchmark dataset was constructed. 
(TIF) 

Figure S6 Sensitivity of our method when we add noisy 
sites in leaf nodes. Sensitivity of our method on the uncertainty 
of number of binding sites in leaf nodes was determined by 

randomly deleting/adding 5% sites in +/—100 bp of peak summit 
in all the species except human. Predictions were compared with 
original results and the changes of branch of origin for each 
binding site were counted. Y-axis shows the percentage of binding 
sites with various branch change. 0 means the prediction of branch 
of origins remain same before and after we randomly delete or 
insert binding sites. 
(TIF) 

Figure S7 Sensitivity of our method when we change the 
branch lengths. Sensitivity of our method on the branch length 
of phylogenetic tree was characterized by randomly changing the 
branch lengths to a certain extent. The length of each branch in 
phylogenetic tree can be varied in certain range relative to its 
original length (shown on X-axis). Y-axis shows the percentage of 
binding sites that have different branch of origin before and after 
we randomly change branch lengths in the phylogenetic tree. 
(TIF) 

Table SI Comparison of consensus motifs. 

(PDF) 

Table S2 Gene functions and pathways associated with 
hominid-specific TFBS. 

(PDF) 

Table S3 Performance comparison with MotiflMap and 
PReMod. 

(PDF) 
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Table S4 Gene functions and pathways associated with 
simian-specific TFBS. 

(PDF) 

Table S5 Gene functions and pathways associated with 
ancestral TFBS. 

(PDF) 

Text SI Supplementary methods and supplementary 
results. 

(PDF) 
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